K-means Clustering

K-mean is, without doubt, the most popular clustering method. Researchers released the algorithm decades ago, and lots of improvements have been done to k-means.

The algorithm tries to find groups by minimizing the distance between the observations, called local optimal solutions. The distances are measured based on the coordinates of the observations.

Importing data

We will use the Prices of Personal Computers dataset to perform our clustering analysis. This dataset contains 6259 observations and 10 features. The dataset observes the price from 1993 to 1995 of 486 personal computers in the US. The variables are price, speed, ram, screen, cd among other.

library(dplyr)
D <-read.csv("Datasets/Computers.csv",h=T)
df <- D %>%  select(-c(X, cd, multi, premium))
head(df)
   price speed  hd ram screen ads trend
 1  1499    25  80   4     14  94     1
 2  1795    33  85   2     14  94     1
 3  1595    25 170   4     15  94     1
 4  1849    25 170   8     14  94     1
 5  3295    33 340  16     14  94     1
 6  3695    66 340  16     14  94     1
rescale_df <- df %>%
    mutate(price_scal = scale(price),
           hd_scal = scale(hd),
           ram_scal = scale(ram),
           screen_scal = scale(screen),
           ads_scal = scale(ads),
           trend_scal = scale(trend)) %>%
    select(-c(price, speed, hd, ram, screen, ads, trend))

You rescale the variables with the scale() function of the dplyr library. The transformation reduces the impact of outliers and allows to compare a sole observation against the mean. If a standardized value (or z-score) is high, you can be confident that this observation is indeed above the mean (a large z-score implies that this point is far away from the mean in term of standard deviation. A z-score of two indicates the value is 2 standard deviations away from the mean. Note, the z-score follows a Gaussian distribution and is symmetrical around the mean.

km <- kmeans(rescale_df, 5)
# matrix of cluster centres
km$centers
   price_scal    hd_scal   ram_scal screen_scal   ads_scal trend_scal
 1 -0.5464214 -0.6352703 -0.6320622 -0.45584738  0.6881761 -0.3382442
 2 -0.7586644  0.2832692 -0.3204174  0.08806206 -0.8936123  1.2198712
 3  0.2679166 -0.7060987 -0.3074995 -0.24669894 -1.3212297 -1.5498077
 4  0.8634498  2.0584350  2.0734266  0.63708443 -0.9224672  1.2199320
 5  1.0242599  0.1953981  0.5121237  0.49673680  0.6635108 -0.3172871
# A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
km$cluster[1:30]
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# The total sum of squares.
km$totss
 [1] 37548
# Vector of within-cluster sum of squares, one component per cluster.
km$withinss
 [1] 3178.529 3307.672 1492.700 2374.915 5114.641
# Total within-cluster sum of squares, i.e. sum(withinss).
km$tot.withinss
 [1] 15468.46
# The between-cluster sum of squares, i.e. totss-tot.withinss.
km$betweenss
 [1] 22079.54

Optimal k

One technique to choose the best k is called the elbow method. This method uses within-group homogeneity or within-group heterogeneity to evaluate the variability. In other words, you are interested in the percentage of the variance explained by each cluster. You can expect the variability to increase with the number of clusters, alternatively, heterogeneity decreases. Our challenge is to find the k that is beyond the diminishing returns. Adding a new cluster does not improve the variability in the data because very few information is left to explain.

kmean_withinss <- function(k) {
    cluster <- kmeans(rescale_df, k)
    return (cluster$tot.withinss)
}
kmean_withinss(2)
 [1] 27087.07
# Set maximum cluster 
max_k <-20 
# Run algorithm over a range of k 
wss <- sapply(2:max_k, kmean_withinss)
# Create a data frame to plot the graph
elbow <-data.frame(k= 2:max_k, wss)
library(plotly)
p <- elbow %>% ggplot(aes(x=k, y=wss)) +
  geom_line() +
  geom_point(size=3,color="mediumvioletred")

p <- ggplotly(p)
p

Examining the cluster

km_2 <-kmeans(rescale_df, 7)
km_2$size       
 [1]  228  448  539  910  458 1554 2122
center <-km_2$centers

Visualize

You can create a heat map to help us highlight the difference between categories.

library(plotly)
library(reshape2)
center_df <- center %>%
  melt(c("Cluster","Attributes")) %>%
  ggplot(aes(Attributes,Cluster, fill = value)) +
  geom_tile()+ scale_fill_gradient2(low = "navy", mid = "white", high = "mediumvioletred")
center_df <- ggplotly(center_df)
center_df

K-medoid Clustering

library(cluster)
library(dplyr)
library(reshape2)
library(plotly)
credit <- read.csv("Datasets/credit.csv",h=T)
credit$Income <- as.factor(credit$Income)
credit$Credit_cards <- as.factor(credit$Credit_cards)
credit$Education <- as.factor(credit$Education)
credit$Car_loans <- as.factor(credit$Car_loans)
D <- credit %>% select(-Credit_rating)
D_mat <- daisy(D,metric = "gower")
kmed <- pam(D_mat,2,diss=T)
kmed$medoids
 [1]  373 1947
credit[kmed$medoids,]
      Credit_rating      Age Income Credit_cards Education Car_loans
 373              0 31.92189      2            2         2         2
 1947             1 34.31873      2            1         1         1
M <- data.frame(table(kmed$clustering,credit$Credit_rating))
names(M)[1:2] <- c("cluster","Credit_rating")
M
   cluster Credit_rating Freq
 1       1             0  888
 2       2             0  132
 3       1             1  745
 4       2             1  699
p2 <- M %>%
  ggplot(aes(x = cluster,y= Freq,fill=Credit_rating)) + 
  geom_bar(stat="identity", position = position_dodge2())

p2 <- ggplotly(p2)
p2